.. _opendata_analysis:
Working with SOEP Data in Open Data Format
********************************************
The SOEP offers data in the new Open Data Format (ODF). This format includes metadata accessible
in Stata, R, and Python making it particularly useful for R- and Python-users who want to load SOEP-core datasets
with metadata directly in R and Python.
This guide demonstrates how to open and work with SOEP data in ODF format using R, Python or Stata.
The Open Data Format
=====================
The Open Data Format (ODF) is a metadata-enriched, non-proprietary, and platform-independent data format.
It includes metadata in multiple languages stored in DDI Codebook Format. The format is specified as a
zip-compressed folder containing a CSV file with the data and an XML file with the metadata.
To import and export data in the ODF format, the SOEP provides packages for Stata, R, and Python.
For further information on the Open Data Format, visit the `ODF website `_.
Using the ODF in R
======================
**Installing the Opendataformat Package in R**
To get started with SOEP data in the open data format you need to install the `opendataformat` package from
CRAN using `install.packages`.
.. code:: r
install.packages("opendataformat")
library("opendataformat")
**Read a Data File**
Now you can use the package functions to work with ODF data files. For demonstration purposes, let's load the sample data included in the `opendataformat`-package.
To open the documentation/help-files for the `read_odf()` function you can execute `?read_odf`.
.. code:: r
path <- system.file("extdata", "data.zip", package="opendataformat")
?read_odf
df <- read_odf(file = path)
You can specify further parameters of the `read_odf()`-function: The languages in which metadata should
be imported, the (maximum) number of rows you want to import, the number of rows you want to skip
(excluding the header), and the variables/columns you want to import (indices or column names).
In the following code line, you see the default parameters:
.. code:: r
df <- read_odf(file = path, languages = "all", nrows = Inf, skip = 0, select = NULL)
**Display and Use Metadata**
To display metadata of a dataset of a variable in the viewer you can use the docu_odf()-function.
.. code:: r
#To display dataset information
docu_odf(df)
.. figure:: png/opendataformat1_docu.png
:align: center
:alt: R viewer output of docu odf for dataset
:scale: 60%
.. raw:: html
You can also display information for a specific variable:
.. code:: r
#Variable information
docu_odf(df$bap87)
.. figure:: png/opendataformat2_docu.png
:align: center
:alt: R viewer output of docu odf for a variable
:scale: 60%
.. raw:: html
You can also choose the language of the metadata displayed, if the information is available in a particular language,
by setting the `languages` parameter to a particular language (`languages="de"`) or display metadata in all available
languages (`languages="all"`). Default is the current language (which is normally English). You can set the default
language of a dataframe in ODF-format using the `setLanguage_odf()`-function (e.g. `df<-setLanguage_odf(df, "de")`).
.. code:: r
# Alternatively, the metadata can be displayed in the console using by setting `style="print"` or `style="console"`, or in both with `style="both"`.
docu_odf(df$bap87, style = "print")
# You can also choose the language of the metadata (when the language is available).
docu_odf(df$bap87, languages="de")
# To display the metadata in all available languages, set languages="all"
docu_odf(df$bap87, languages="all")
# You can also set the default language of a dataset
df<-setlanguage_odf(df, "de")
# Then variable information will by default be displayed in the respective language:
docu_odf(df$bap87)
The function `read_odf()` reads the data to R as an tibble (dataframe).
The metadata is stored in the attributes of the ODF-tibble object and in the attributes of the columns of the ODF-tibble object.
To retrieve attributes you can use the R base functions `attr()` and `attributes()`.
.. code:: r
# display all attributes of a dataframe
attributes(df)
# display all attributes of a variable/column
attributes(df$bap87)
# You can also display a specific attribute of a dataframe:
attributes(df)$label_de
attr(df, "label_de")
# Or of a variable
attributes(df$bap87)$labels_de
attr(df$bap87, "labels_de")
Alternatively, you can use the `getmetadata_odf()`-function from the `opendataformat`-package to retrieve labels and other metadata.:
.. code:: r
# display the labels of all variables in a dataframe (in the active/current language)
getmetadata_odf(df, type = "labels")
#in a specific language
getmetadata_odf(df, type = "labels", language="de")
# To see which languages are available for the dataframe, set retrieve=languages
getmetadata_odf(df, type="languages")
#you can also display the value labels of a specific variable:
valuelabels<-getmetadata_odf(df$bap87, type="valuelabels")
valuelabels
# You can also display the descriptions, the URLs or the variable types :
# Descriptions of all variables:
getmetadata_odf(df, type="description")
# of one variables:
getmetadata_odf(df$bap87, type="description")
# URLs:
getmetadata_odf(df, type="url")
# variable types:
getmetadata_odf(df, type="type")
**Save a dataset as an ODF-File**
To save a dataset as an ODF-file again, the `opendataformat` package provides the `write_odf()`-function.
All the metadata stored in the attributes, that is compatible with the ODF-specification is preserved in the ODF-file.
The compatible metadata for the dataframe includes the dataset name in the `name`-attribute, labels with a language tag (e.g. label_en),
descriptions with a language tag (e.g. description_en), and a URL in the `url`-attribute.
The compatible metadata for the variables/columns includes labels with a language tag (e.g. label_en),
descriptions with a language tag (e.g. description_en), the variable type in the `type`-attribute, value labels with a language tag (e.g. labels_en), and a URL in the `url`-attribute.
You can choose to save metadata in one or several specific languages only using the `languages`-argument. Default is `languages="all"`.
.. code:: r
# Write the dataframe `df` to the ODF-file `my_datafile.zip` in the current working directory.
write_odf(x=df, file="my_datafile.zip")
# Write the dataframe `df` while keeping only metadata in English.
write_odf(x=df, file="my_datafile.zip", languages="en")
For further instructions on how to work with the ODF-format in R, read the `vignette `_
for the `opendataformat`-package.
Using the ODF in Stata
==========================
**Installing the Opendataformat Package**
To work with SOEP-Data in the open data format in Stata you have to install the `opendf` package in Stata
using `ssc install` to download and install the package from SSC.
.. literalinclude:: docs/how_to_use_opendataformat.do
:linenos:
:lines: 4-5
**Python Integration**
For the `opendf` package to work, the Python integration in Stata is required. Therefore, you need Stata version 16 or higher and some Python installation on your computer.
You can easily test if Python is working within Stata. If you have problems with the Python integration, you can copy a Python version to your computer using `opendf installpython`
function from the `opendf` package:
.. literalinclude:: docs/how_to_use_opendataformat.do
:linenos:
:lines: 9-40
**Read a Data File**
Now you can use the package functions to work with ODF-data files. For demonstration purposes, we copy the testdataset (`testdata.zip`) from GitHub to the local temp-directory:
.. literalinclude:: docs/how_to_use_opendataformat.do
:linenos:
:lines: 43-44
With the `opendf read` function you can load the data file to Stata:
.. literalinclude:: docs/how_to_use_opendataformat.do
:linenos:
:lines: 47-51
You can specify further parameters of the `opendf read`-function: The range of rows you want to load (excluding the header),
the range of columns you want to load, and whether you want to save the dataset directly as `.dta`-file.
.. literalinclude:: docs/how_to_use_opendataformat.do
:linenos:
:lines: 47-58
**Display and Use Metadata**
To display metadata you can use the `opendf docu`-function for the dataset information:
.. literalinclude:: docs/how_to_use_opendataformat.do
:linenos:
:lines: 65-67
.. figure:: png/opendataformat3_docu.png
:align: center
:alt: Stata console output of opendf docu
:scale: 75%
.. raw:: html
Or information on a specific variable:
.. literalinclude:: docs/how_to_use_opendataformat.do
:linenos:
:lines: 69-70
.. figure:: png/opendataformat4_docu.png
:align: center
:alt: Stata console output of opendf docu bap87
:scale: 75%
.. raw:: html
You can also choose the languages for the metadata displayed, if the information is available in particular languages,
by setting the `languages` parameter to these languages (e.g. `languages("en")`) or to display metadata for all
languages (`languages("all")`). Default is the active label language. You can set the active language using the `label language`-command.
.. literalinclude:: docs/how_to_use_opendataformat.do
:linenos:
:lines: 72-77
The metadata is stored in the characteristics of the dataset.
To retrieve them you can use the Stata base functions:
.. literalinclude:: docs/how_to_use_opendataformat.do
:linenos:
:lines: 79-84
**Save a Dataset as an ODF-File**
To save a dataframe as an ODF-file, the `opendataformat` package provides the `opendf write`-function.
All the metadata stored in the characteristics, that is compatible with the ODF-specification is preserved in the ODF-file.
The compatible metadata for the dataframe includes the dataset name in the `dataset`-characteristic, labels with a language tag (e.g. label_en),
descriptions with a language tag (e.g. description_en), and a URL in the `url`-characteristic.
The compatible metadata for the variables/columns includes labels with a language tag (e.g. label_en),
descriptions with a language tag (e.g. description_en), the variable type in the `type`-characteristic,
value labels with a language tag (e.g. labels_en), and a URL in the `url`-characteristic.
You can also specify the columns/variables to save and the languages of the metadata. By default, the metadata is saved in all available languages (`languages("all")`)
.. literalinclude:: docs/how_to_use_opendataformat.do
:linenos:
:lines: 87-104
Using the ODF in Python
==========================
**Installing the Opendataformat Package**
To work with SOEP-Data in the open data format in Python you have to install the `opendataformat` library using `pip install`.
Currently the development version can be installed from GitHub using the CMD command `pip install git+`:
.. code-block:: bash
pip install git+https://github.com/opendataformat/python-package-opendataformat.git
**Read an ODF Data File**
To use the `opendataformat` package you have to import the library in python.
.. code-block:: python
import opendataformat as odf
Now you can use the package functions to work with ODF-data files. With the `read_odf()` function you can load the data file to Python.
For demonstration purposes we load the example-dataset from the ODF-Website:
.. code-block:: python
df = odf.read_odf('https://opendataformat.github.io/files/example_dataset.zip')
You can specify further parameters of the `read_odf()` function: The number of rows you want to read (excluding the header),
the number of rows you want to skip (excluding the header), the columns you want to load, the metadata languages you want to load,
and which values should be treated as NAs.
.. code-block:: python
df = odf.read_odf(path = 'https://opendataformat.github.io/files/example_dataset.zip', languages = "all", usecols = None, skiprows=None, nrows=None, na_values = None)
**Display and Use Metadata**
To display metadata you can use the `docu_odf()` function for the dataset information:
The `docu_odf()` function both displays the metadata dictionary and returns it.
.. code-block:: python
# Display metadata for dataset and assign metadata dictionary to `metadata`
metadata = odf.docu_odf(df)
.. figure:: png/opendataformat5_docu.png
:align: center
:alt: Python output of docu_odf for dataset
:scale: 75%
.. raw:: html
You can also use `docu_odf()` to display the metadata for a variable:
.. code-block:: python
# Display metadata for variable and assign metadata dictionary to `variable_metadata`
variable_metadata = odf.docu_odf(df.bap87)
.. figure:: png/opendataformat6_docu.png
:align: center
:alt: Python output of docu_odf for a variable
:scale: 75%
.. raw:: html
You can also display a specific metadata field in a specific language.
You can choose the languages for the metadata displayed, if the information is available in particular languages,
by setting the `languages` parameter to these languages (e.g. `languages = "de"`).
To select a specific metadata field set the `metadata`-parameter to the specific metadata entry (e.g. `metadata = "valuelabels"`).
.. code-block:: python
# Display value labels in German and assign to valualabels_de
valuelabels_de = odf.docu_odf(df.bap87, metadata = "valuelabels", languages = "de")
.. figure:: png/opendataformat7_docu.png
:align: center
:alt: Python output of docu_odf for value labels in German
:scale: 75%
.. raw:: html
**Save a Dataset as an ODF-File**
To save a Pandas dataset as an ODF-file, the `opendataformat` package provides the `write_odf()`-function.
All the metadata stored in the attributes, that is compatible with the ODF-specification is preserved in the ODF-file.
The compatible metadata for the dataframe includes the dataset name in the `dataset`-attribute, labels with a language tag (e.g. label_en),
descriptions with a language tag (e.g. description_en), and a URL in the `url`-attribute.
The compatible metadata for the variables/columns includes labels with a language tag (e.g. label_en),
descriptions with a language tag (e.g. description_en), the variable type in the `type`-attribute,
value labels with a language tag (e.g. labels_en), and a URL in the `url`-attribute.
You can also specify the columns/variables to save and the languages of the metadata.
By default, the metadata is saved in all available languages (`languages = "all"`)
.. code-block:: python
# Safe dataset as `dataset.zip` in the current working directory
odf.write_odf(df, 'dataset.zip')
# You can also keep only the metadata for (a) specific language(s) using the languages-parameter.
odf.write_odf(df, 'dataset.zip', languages = "all")
Last change: |today|